Appendix G — Practice Final Solutions

STAT303-2 (Winter2023)

Author

Angelica Wang, Naoki Ito, Yida Hao, Victoria Shi, Nayada Tantichirasakul, Radhika Todi, Ally Bardas, Mingyi Gong, Yuyan Zhang, Annabel Skubisz, Karrine Denisova, Hoda Fakhari, Catherine Erickson, Anjali Patel, Elena Cantu and Arvind Krishna.

Published

March 9, 2023

Abstract
These solutions are composed by students of the course STAT303-2 (Winter 2023).

Multiple choice questions

G.1 Potential problems

Presence of which of the following potential problems in a linear regression model may lead to statistically significant variables appearing insignificant?

  1. Multicollinearity

  2. Outliers

  3. Overfitting

Answer: A and B

Explanation:

A) Multicollinearity:

Recall, the estimated variance of the coefficient \(\beta_j\), of the \(j^{th}\) predictor \(X_j\), can be expressed as:

\[\hat{var}(\hat{\beta_j}) = \frac{(\hat{\sigma})^2}{(n-1)\hat{var}({X_j})}.\frac{1}{1-R^2_{X_j|X_{-j}}} \hspace{5cm} (1)\]

If the predictor \(X_j\) is collinear with other predictors, \(R^2_{X_j|X_{-j}}\) will be large, which in turn will inflate \(\hat{var}(\hat{\beta_j})\). In other words, multicollinearity inflates the standard errors of the coefficients for which the variables are collinear. Since \(t\)-statistic is calculated by dividing the estimated coefficient by its standard error, the \(t\)-statistics shrinks, and the corresponding \(p\)-value increases. Therefore, the hypothesis test loses the power to reject the null hypotheses, and thus statistically significant variables appearing insignificant.

Another way to think about this can be that if some predictors are collinear, it can be difficult to separate out the individual effects of these variables in the response and significant variables may appear insignificant.

B) Outliers

Recall, the estimate of error variance is given by:

\[\hat{\sigma}^2 = {\frac{RSS}{n-2}},\] where RSS is the residual sum of squared errors. Outliers result in an increase in \(RSS\), leading to an increase in the estimated error variance \(\hat{\sigma}^2\), which in turn inflates \(\hat{var}(\hat{\beta_j})\). The rest of the explanation follows from the previous explanation on multicollinearity.

C) Overfitting

Overfitting shrinks \(RSS\), which in turn shrinks \(\hat{\sigma}^2\), thereby shrinking \(\hat{var}(\hat{\beta_j})\). Thus overfitting will act in way opposite to what we observe in (A) and (B).

G.2 Potential problems

Classify a data point as influential / outlier / high leverage in a linear regression model, based on the description.

  1. The data point is likely to have a large effect on the model in terms of prediction: Influential point

  2. The data point has the potential to have a large effect on the model in terms of prediction: High leverage point

  3. The data point is likely to inflate the model R-squared: High leverage point that is not influencial

  4. The data point is unlikely to have a large effect on the model in terms of prediction: outlier

Explanation:

See the graphics in class presentation on Chapter3_Outliers_high_leverage_influential_points. Think of influential points / high leverage points / outliers as a force (proportional to the residual corresponding to the point) pulling a canteliver beam. Depending on the position from where you pull the cantilever beam, you may move it too much or too little.

A) Inluential point (high leverage & outlier): an outlier with the respect to both the predictor and the response. It has a large effect on the regression line. As shown in the graphics, influence is higher for more extreme outliers with same leverage and for points with higher leverage & similar outlying distance.

B) High leverage point: Observations with high leverage have an unusual value for the predictor (ie. lie outside the domain of most points). High leverage point has the potential to have a large affect on the regression line. It is cause for concern if the least squares line is heavily affected by just a couple of observations, because any problems with these points may invalidate the entire fit.

C) If you have a high leverage point that is not influencial: The variance of the response may increase in the presence of high leverage points, since an unusual set of predictor values may correspond to an unusual response, which may increase the total variation. However, as the point is not inluential, the increase in the unexplained variation (the squared residual) will not be proportionate to the increase in total variation. As \(R^2\) is one minus the ratio of unexplained variation to total variation, it is likely to increase.

D) Outliers: As shown in the graphics, outliers very small effect on prediction.

G.3 Autocorrelation

A linear regression model was developed to predict the number of passengers taking a flight per month. The data consists of number of passengers flying each month from January 1949 to December 1960. The autocorrelation plot below shows the correlation of the residuals with the lagged residuals of the model. Choose the most appropriate option.

  1. The above plot shows the presence of autocorrelation. The 6-month lagged response is the most appropriate lag to be added as a predictor in the model to address autocorrelation

  2. The above plot shows the presence of autocorrelation. The 12-month lagged response is the most appropriate lag to be added as a predictor in the model to address autocorrelation

  3. The above plot shows the presence of autocorrelation. The 1-month lagged response is the most appropriate lag to be added as a predictor in the model to address autocorrelation

  4. The above plot shows the absence of autocorrelation as the plot must have a cyclical pattern in the presence of autocorrelation

  5. The above plot shows the absence of autocorrelation as the one month lagged residual must have the highest correlation with the residual in the presence of autocorrelation

Answer: B

Explanation: As seen in the plot, the residuals are highly correlated (correlation of more than 60%) with lagged residuals of 12 months. This shows the presence of autocorrelation. To address autocorrelation, the 12-month laggged response will be the most appropriate as it has the highest correlation with the response. Thus, it will explain the variation in the respone the most.

There is no need for there to be a cyclical pattern for autocorrelation. Even if one of the lagged residuals are highly correlated with the residual, it shows the presence of autocorrelation.

G.4 Logistic regression (goodness-of-fit)

Which of the following metrics can be used to assess the goodness-of-fit of a logistic regression model?

  1. All of these

  2. LL-Null

  3. Log-Likelihood

  4. Df Model

  5. R-squared

Answer: Log-Likelihood

Explanation In logistic regression, the response is assumed to follow a Bernoulli distribution, where the probability of success is a function of the predictors and its coefficients (the model parameters). With this assumption, one can compute the the joint probability density of the observed data as a function of the model parameters. This creates a set of probability distributions (based on different values of model parameters) that could have generated the data. The algorithm finds the values of the model parameters (the beta coefficients) such that the probability of observing the data maximizes. This probability is the likelihood, and its logarithm is the log-likelihood. The higher the log-likelihood, the more probable it is to observe the data. Thus, log-likelihood is a way to measure the goodness-of-fit of the model.

LL-NULL is the log-likelihood of the model with no parameters. This is compared with the log-likelihood of the model with predictors to test if the regression is statistically significant.

Df Model is the number of predictors in the model.

R-squared cannot be used for logistic regression as there are no residuals.

G.5 Logistic regression (threshold probability)

For a logistic regression model, as we increase the decision threshold probability,

  1. None of these

  2. the recall will reduce or stay the same

  3. the ROC-AUC will increase or stay the same

  4. the precision will increase or stay the same

  5. the classification accuracy will increase or stay the same

Answer: B

Explanation: See class slide on the confusion matrix below.

Increasing threshold probability means that less observations are predicted to be positive. Hence, some TP could turn into FN, reducing the recall. (this might not happen if there is no observations of actual positives between the thresholds). ROC-AUC is independent of the threshold probability. Both precision and classification accuracy might decrease if the number of FP among actual negatives increase more than the increase of TP among actual positives by the shift in the threshold.

G.6 Decision threshold probability

Which of the following metrics is independent of the decision threshold probability?

  1. None of these

  2. ROC-AUC

  3. All of these (except the “None of these” option)

  4. Precision

  5. Recall

Answer: ROC-AUC

Explanation By changing the threshold, the number of points classified as negative and positive may change, and so TP, FP, TN and FN may change. Recall and precision may change as they are based on these metrics (TP ,FP, TN, and FN). However, the ROC-AUC specifically analyzes different thresholds. The ROC curve is a plot of TPR against FPR for all possible thresholds, and ROC-AUC is the area under the ROC curve, so the value itself is independent from the decision threshold probability.

G.7 Odds

Consider the following logistic regression model:

\[p(x) =\frac{1}{1+e^{-(\beta_0+\beta_1x)}}\].

Which of the following metrics will depend on the value of x?

  1. Odds ratio when x increases by 2 units

  2. increase in log odds when x increases by 10 units

  3. All of these

  4. Increase in predicted probability when x increases by 1 unit

E)none of these

Answer: D

Explanation:

\[p(x) =\frac{1}{1+e^{-(\beta_0+\beta_1x)}}\]

\[\implies \log\bigg(\frac{p(x)}{1-p(x)}\bigg) = \beta_0 + \beta_1x\]

\[\implies \log\big(Odds(x)\big) = \beta_0 + \beta_1x\]

When \(x\) increases by ‘c’ units,

\[p(x+c) - p(x) =\frac{1}{1+e^{-(\beta_0+\beta_1(x+c))}}-\frac{1}{1+e^{-(\beta_0+\beta_1(x))}}\]

\[log({Odds(x+c)}) - log({Odds(x)}) = \beta_1 c\]

\[\frac{Odds(x+c))}{Odds(x)} = e^{\beta_1 c}\]

We can see that only the increase in predicted probability when \(x\) increases by 1 unit is dependent on \(x\).

G.8 Precision-recall

We develop a logistic regression model to predict whether someone will pay a loan back or not. Loans are “approved” by us only for those borrowers who are predicted to pay back. The positive class is the borrowers that pay back the loans. What would a recall of 81% mean?

  1. 81% of the borrowers that would pay back the loan are approved by us: Recall = TP/(TP + FN). TP here are those who are [approved by us] who [pay back the loan], while FN are those who were [not approved by us] but actually [pay back the loans]. The denominator is [all who pay back the loan]. Thus, Recall here means: among [all who pay back the loan], 81% are [approved by us].

  2. Of all the loans we approve, 81% pay us back: This is Precision = TP/(TP + FP)

  3. Of all the loans we don’t approve, 81% would not have paid us back if they were given the loan: This is the proportion of negatives correctly predicted - like precision for the negative class

  4. Of all the loans we don’t approve, 19% would not have paid us back if they were given the loan: This is the proportion of negatives incorrectly predicted.

Answer: 81% of the borrowers that would pay back the loan are approved by us.

Explanation: Recall = True Positives/(True Positives + False Negatives).

In this case, True positives are those who got approved and would pay back. False Negatives are those we didn’t approve, but would pay back. Therefore, 81% Recall means 81% of the borrowers that would pay back the loan are approved by us.

G.9 Variable selection

Which of the following algorithms can be used for variable selection?

  1. Lasso

  2. Ridge regression

  3. Forward stepwise selection

  4. Best subset selection

Answer: A,C,D

Explanation: Both lasso and ridge regression are regularized least squares model, where the a shrinkage penalty is added to the ordinary least squares cost function. The shrinkage penalty in ridge regression shrinks the regression coefficients estimate towards zero, but not exactly zero, while the shrinkage penalty in lasso tends to give a set of zero regression coefficients and leads to a sparse model. Therefore, lasso can be used for variable selection, but not ridge regression.

Forward stepwise and best subset selection are variable selection algorithms by fitting multiple models having different combinations/number of predictors and choosing the best model.

G.10 Precision-recall

You are building a facial recognition model to allow people to unlock their phone. If the phone recognizes the person as the authorized user, it will unlock the phone. If it doesn’t recognize the user, it will prompt them to try again or try an alternative method (such as a passphrase). The facial recognition model is a classification model that identifies if the person unlocking the phone is the authorized user (positive response) or not (negative response).

Assume that letting a stranger (unauthorized user) unlock the phone is more risky (or more expensive) than not letting the authorized user unlock the phone.

Which of the following metric is the most important to optimize in the model?

  1. Precision

  2. Classification accuracy

  3. Recall

  4. ROC-AUC

Answer: Precision

Explanation:

A) Precision: Precision = TP/(TP + FN). Here, FN are those who are falsely assigned as an unauthorized user when they are actually the authorized user. FP are those who are assigned as the authorized user and are actually an unauthorized user. In this case, it’s important to optimize precision because it is more important to reduce the number of FP (strangers being recognized as the authorized user) than to reduce the number of FN (authorized user not being recognized).

B) Classification accuracy: This is incorrect because a model with high accuracy but a high FPR would be unacceptable since it would increase the risk of a stranger unlocking the phone.

C) Recall: This is incorrect because a high recall indicates that many of the positive cases are being detected. However, it does not measure the fraction of unauthorized users that the model identifies as authorized. A high FPR could lead to an unauthorized user unlocking the phone, which is a more expensive mistake than an FN.

D) ROC-AUC: This is incorrect because ROC-AUC does not take into account the cost of the positive and negative classes. It only measures how well the model can distinguish between authorized users and unauthorized users.

G.11 Logistic regression

Consider the following logistic regression model:

\[p(x) = \frac{1}{1 + e^{-(\beta_0 + \beta_2 x_1 + \beta_2 x_2)}}\]

where assuming the threshold probability for classifying observations is 0.5. All observation with predicted probability greater than 0.5 are classified as belonging to class \(y=1\), while others are classified as belonging to class \(y=0\).

Which of the following plots correctly visualizes the predicted class based on \(x_1\) and \(x_2\)?

Answer: D

Explanation:

\(x_1\) will not have an impact on the outcome because its coefficient is 0. When \(x_2>5, p(x)\) will be less than \(0.5\) and \(y\) will equal 0, as the decisions threshold probability ois 0.5. When \(x_2<5, p(x)\) will be greater than \(0.5\) and \(y\) will equal \(1\).

G.12 ROC-AUC

In which of the following cases will ROC-AUC be the most appropriate metric to optimize among the all the performance metrics we have seen in this course.

  1. There are wide disparities in the cost of false negatives vs. false positives, for example, predicting if the person has a serious disease.

  2. The predicted probabilities will be used to rank observations, instead of classifying them, for example, the Google search engine using the predicted probabilities to rank pages in the decreasing order of relevance to the search query, instead of classifying the observations as ‘relevant’ and ‘not relevant’.

  3. We wish to maximize the overall classification accuracy, for example, predicting if a person will vote for the Democrat or the Republican candidate in the US Presidential elections. Here, you may assume that the cost of false positives is similar to the cost of false negatives.

Answer: (B) only

Explanation:

(A) is incorrect because in cases where there are wide disparities in the cost of false negatives vs. false positives, it may be critical to minimize the performance metric associated with a higher loss. For example when predicting if the person has a serious disease, a false positive could lead to expensive and unnecessary medical treatment. Conversely, a false negative could result in a delay in diagnosis and treatment, potentially leading to a worse outcome. Thus, we want to prioritize minimizing false negatives. Since ROC-AUC is decision-threshold invariant, it’s not a useful metric for this type of optimization.

(B) is correct. ROC-AUC is scale-invariant. It measures how well predictions are ranked, rather than the absolute values of the predicted probabilities. Check the link.

(C) is incorrect because AUC is classification-threshold-invariant. It measures the quality of the model’s predictions irrespective of what classification threshold is chosen. However, the overall accuracy changes with change in decision threshold probability. To maximize overall accuracy, we need to find the optimal decision threshold probability.

G.13 Model selection

Which of the following linear model selection methods can be used when number of predictors is greater than the number of observations in linear regression?

  1. Lasso

  2. Ridge regression

  3. Forward stepwise selection

  4. Backward stepwise selection

  5. Best subset selection

Answer: A, B, C

Expanation:

When number of predictors is greater than the number of observations, then, in case of ordinary least squares regression, the number of parameters are greater than the number of equations available to estimate those parameters. Thus, there is no unique solution. Also, in equation (1), \(R^2_{X_j|X_{-j}}\) is 1, and so \(\hat{var}(\hat{\beta_j})\) tends to infinity. Thus, it is not possible to fit an ordinary least squares model in this case. As backward stepwise selection begins with a model considering all predictors, it cannot be used as it is not possible to develop a model with all predictors in this case.

The best subset selection must consider models with all possible combination/number of predictors. Howvever, as number of predictors cannot be greater than the number of observations, we cannot develop all possible models, and thus we cannot use best subset selection.

Forward stepwise starts with no predictors and adds one predictor at a time. It is possible to keep adding predictors until the number of predictors (excluding the intercept) is one less than the number of observations. From this set of models, the best model can be chosen based on AIC, BIC, or any goodness-of-fit criteria that accounts for the number of predictors. Thus, it is possible to use forward stepwise selection.

The shrinkage penalty in lasso and ridge regression reduces the variance of the coefficents. The variance is infinity without any penalty when the number of predictors is greater than the number of observations. However, as lasso and ridge regression shrink the penalty, it is possible to obtain a unique solution.

The following is just for your information, but beyond the scope of this course: With lasso, there will be at most as many non-zero predictors as the number of observations. With ridge regression, there may be more non-zero predictors as compared to the number of observations.

G.14 Goodness-of-fit

Which of the following metrics can be used to compare the goodness-of-fit of models with different number of predictors?

  1. AIC (Akaike Information criterion)

  2. R-squared

  3. Log-Likelihood

  4. Pseudo R-squared

  5. LLR p-value

Answer: AIC

Explanation: AIC is correct because it takes into account both the goodness-of-fit of the model and the complexity of the model (number of predictors).

\[AIC = -2logL + 2d,\]

where \(L\) is the maximized value of the likelihood function for the estimated model, and \(d\) is the number of predictors. From the above equation, we can see that \(AIC\) penalizes models with more parameters. Therefore, AIC allows for comparison between models with different numbers of predictors and helps to determine which model is the best fit.

All the other metrics will increase, while the LLR p-value will decrease with increase in number of predictors, and thus cannot be used to compare models with different number of predictors.

G.15 Model selection

Given a set of predictors, which of the following model selection methods guarantees to provide the best ‘least squares’ linear regression model, based on adjusted R-squared?

  1. Best subset selection

  2. Forward stepwise selection

  3. Backward stepwise selection

  4. Linear regression with all the statistically insignificant predictors removed

Answer: A

Explanation: Best subset selection considers every single model possible, while forward and backward selection don’t. Therefore, best subset selection necessarily gives the best model, while stepwise selection methods do not. For example, if a model consisting of predictors \(x_2\) and \(x_3\) is the best possible model, with regards to adjusted \(R\)-squared, while the model consisting of \(x_1\) is the best one predictor model, then forward stepwise selection will fail to identify the best possible model.

Adjusted \(R\)-squared depends on the residuals, among other things. However, the residuals don’t relate directly to statistical significance of predictors. Statistical significance of a predictor implies that the predictor is significantly linearly associated with the response, but it does not determine the variation in response explained by the predictor.

G.16 MSE estimate

Which of the following metrics gives the least biased estimate of MSE (mean squared error) on test data?

  • Leave-one-out cross validation error
  • MSE (mean squared error) on a test dataset (or validation set)
  • K-fold cross validation error, where 1<k<n, where n is the number of observations
  • All of these

Answer: Leave-one-out cross validation error

Explanation: Leave-one-out cross-validation offers two advantages:

- It provides a much less biased measure of test MSE compared to using a single test set because we repeatedly fit a model to a dataset that contains n-1 observations. - It tends not to overestimate the test MSE compared to using a single test set.

Textbook p200 Details: The test MSE gives us an idea of how well a model will perform on data it hasn’t previously seen, i.e. data that wasn’t used to “train” the model.

However, the drawback of using only one testing set is that the test MSE can vary greatly depending on which observations were used in the training and testing sets.

One way to avoid this problem is to fit a model several times using a different training and testing set each time, then calculating the test MSE to be the average of all of the test MSE’s.

Like the validation set approach, LOOCV involves splitting the set of observations into two parts (test & train). However, instead of creating two subsets of comparable size, LOOCV uses a single observation \((x_1,y_1)\) for the validation set, and the remaining observations \({(x_2, y_2), . . . , (x_n, y_n)}\) for the training set.

The statistical learning method is fit on the \(n − 1\) training observations, and a prediction \(\hat{y_1}\) is made for the excluded observation using its value \(x_1\). Since \((x_1,y_1)\) was not used in the fitting process, \(MSE_1 = (y_1 − \hat{y_1})^2\) provides an approximately unbiased estimate for the test error.

G.17 Stepwise with categorical variable

Suppose you have a categorical predictor gender in the dataset with 3 distinct values - ‘male’, ‘female’, and ‘other’. Following are three ways to transform this predictor to make it suitable for forward stepwise selection. Which method is likely to provide the best model and which method is likely to provide the worst model, with regard to prediction accuracy on unknown (test) data?

  1. Use the predictor gender as it is for forward stepwise selection.

  2. Convert the predictor to 3 dummy variables - ‘male’, ‘female’, and ‘other’, where each dummy variable has 0s and 1s, depending on the ‘gender’, and use the dummy variables for forward stepwise selection, instead of gender

  3. Replace the values of ‘male’ to 0, ‘female’ to 1, and ‘other’ to 2 in ‘gender’, and then use gender in forward stepwise selection

  4. B is likely to provide the best model and, A or C are likely to provide the worst model

  5. C is likely to provide the best model and A is likely to provide the worst model

  6. C is likely to provide the best model and B is likely to provide the worst model

  7. B is likely to provide the best model and A is likely to provide the worst model

  8. None of these

Answer: B is likely to provide the best model, and A or C are likely to provide the worst model.

Explanation:

B is likely to provide the best model because breaking down a categorical variable into dummy variables allows us to perform stepwise selection for each class in gender. So it gives the algorithm the option to choose among more models than using the predictor gender as it is in the selection.

C is likely to provide a worse model as compared to B since it introduces an unreasonable constraint in the model that holding all other predictors constant, the difference in response between female and male is the same as the difference in response between other and female. In other words, we are converting categorical variables into ordinals which is constraining the model to find the relationship of the predictor in the order of male-female-other even though that may not be the case.

A is likely to provide a worse model as compared to B because it introuces a constraint in the model to either include all genders, or no gender. If only the female gender explains some variation in response, while there is no distinction between male and other, then the model should not be forced to keep the male and other categories. Suppose the other category happens to have a very few observations leading to an unstable coefficient (high standard error), then it may reduce the prediction accuracy.

G.18 ROC-AUC

Which of the following statements regarding ROC-AUC (area under the ROC curve) are true, with regard to a binary classification problem where the classes are named as ‘positive’ and ‘negative’?

  1. ROC-AUC is zero if the model makes random predictions

  2. The larger the ROC-AUC the better the performance of a logistic regression model

  3. ROC-AUC is as the probability that the model ranks a random positive observation more highly than a random negative observation, where a higher rank corresponds to a higher predicted probability

  4. The ROC-AUC can never be negative

Answer: B, C, D

Explanation:

A) is incorrect because ROC-AUC is 50% if the model makes random predictions.

B) is correct because AUC ranges from 0-1. The larger the ROC-AUC, the better the model is distinguishing between the classes. ROC-AUC is 1.0 when there is perfect seperation of classes based on the predicted probability.

C) is correct because AUC is the probability that a randomly selected positive observation has a higher predicted probability as compared to a randomly selected negative observation.

D) is correct because ROC-AUC is a probability.

G.19 Lasso

The lasso, relative to least squares, is:

  1. More flexible and hence will give improved prediction accuracy when its increase in bias is less than its decrease in variance

  2. Less flexible and hence will give improved prediction accuracy when its increase in bias is less than its decrease in variance

  3. Less flexible and hence will give improved prediction accuracy when its increase in variance is less than its decrease in bias

  4. More flexible and hence will give improved prediction accuracy when its increase in variance is less than its decrease in bias

Answer: B

Explanation:

The cost function for lasso consists of the sum of absolute value of coefficients called the shrinkage penalty or the regularization term. This term helps to reduce the complexity of the model by shrinking the regression coefficients towards zero. Because of this penalty, lasso is less flexible (compared to the least squares), as it restricts the range of possible coefficient values, whereas least squares can assign any value to the coefficients. The decreased flexibility means that lasso has less variance in its predictions and more bias.

The penalty terms in the optimization will lead to bias in estimates, leading to less accuracy in prediction. At the same time, reducing the size of coefficients gives them less variance, increasing the accuracy of prediction. Thus, when the increase in bias is less than the decrease in variance, it can lead to improved prediction accuracy. This is a bias-variance trade-off.

G.20 Computational complexity

Arrange the following model selection methods in increasing order of computational complexity:

Ridge regression, best subset selection, forward stepwise selection

Answer: Ridge regression, forward stepwise selection, best subset selection

Explanation: Best subset selection requires the most computation complexity because it considers all \(2^p\) possible models containing subsets of the \(p\) predictors. It cannot be used in case of even a slightly large number of predictors. For example, in case of 30 predictors, more than a billion models will need to be developed to find the best subset model.

In forward stepwise selection, the total number of models with \(p\) predictors is \(\frac{p(p+1)}{2}\). In case of 30 predictors, this will 435 models.

In ridge regression, we need to fit only one model.

G.21 K-fold CV

For optimizing parameters of a model, \(K\)-fold cross validation is preferred over the validation set approach (computing error on a test dataset) because:

  1. Validation dataset may have observations overlapping with the training dataset

  2. Error on validation dataset is likely to be similar to the error on training data, leading to less value addition

  3. \(K\)-fold cross validation is computationally less expensive

  4. Error on the validation dataset can be highly variable

Answer: D

Explanation:

Testing on one validation set is unreliable because the result is highly dependent on the distribution of the data in the test set. By testing on several validation sets, this concern is alleviated to some extent.

\(K\)-fold cross validation is computationally more expensive because \(K\)-fold requires multiple training/testing iterations in order to generate reliable results. This means that the entire dataset must be split into subsets, trained on each subset, and tested. This process must be repeated \(K\) times.

Error on validation dataset is not likely to be similar to the error on training data in scenarios such as overfitting.

The validation dataset approach assigns each observation to either train or test data, but not both.

G.22 Binning

Binning of a continuous predictor, and then using the bins as predictors in logistic regression is likely to be useful when:

  1. The proportion of observations belonging to a class have a non-monotonic relationship with the predictor

  2. The proportion of observations belonging to a class have a monotonic relationship with the predictor

  3. The proportion of observations belonging to a class are almost constant for each bin of the predictor

  4. The predictor has low variance:

Answer: A

Explanation: A single coefficient for a predictor will only provide predicted probabilities that are either continuously increasing or continuously decreasing with increasing predictor values. Thus, it will model only a monotonic (non-decreasing / non-increasing) relationship with the response. In case of a non-monotonic relationship, binning provides the required flexibility to have predicted probabilites that may increase or decrease with increase in predictor values.

If the proportion of observations belonging to a class are almost constant for each bin of the predictor, then the bins do not explain the variation in the response, and thus are not useful.

If the predictor itself is not varying, then it is unlikely to explain the variation in response, and binning cannot help increase the predictor variance.

Coding questions

G.23 Inference - Logistic regression

Develop a logistic regression model to predict if a patient has a risk of a 10 year coronary heart disease. TenYearCHD = 1 means ‘Yes’ and TenYearCHD = 0 means ‘No’. Use all the available predictors plus only one relevant interaction to answer the question below.

Assuming all other predictors are constant, how much percent higher are the odds of male smokers getting diagnosed with heart disease as compared to female smokers. Round up the answer to the nearest integer greater than the answer. For example, if the odds of male smokers getting diagnosed with heart disease are 50.1% higher than that of female smokers, then enter 51 in the box.

import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.linear_model import Ridge, RidgeCV, Lasso, LassoCV
from sklearn.preprocessing import StandardScaler 
from sklearn.metrics import r2_score
import statsmodels.formula.api as sm
import itertools
import time
train = pd.read_csv('./Datasets/train_heart.csv')
test = pd.read_csv('./Datasets/test_heart.csv')
predictors = list(train.columns)[0:train.shape[1]-1]
#logistic model with all predictors and interaction of "male"&"currentSmoker"
model = sm.logit('TenYearCHD~male*currentSmoker+' + '+'.join(predictors), data = train).fit(disp=0)
model.summary()
Logit Regression Results
Dep. Variable: TenYearCHD No. Observations: 2742
Model: Logit Df Residuals: 2725
Method: MLE Df Model: 16
Date: Fri, 10 Mar 2023 Pseudo R-squ.: 0.1085
Time: 02:35:34 Log-Likelihood: -1031.3
converged: True LL-Null: -1156.8
Covariance Type: nonrobust LLR p-value: 3.307e-44
coef std err z P>|z| [0.025 0.975]
Intercept -7.7337 0.835 -9.266 0.000 -9.369 -6.098
male 0.4546 0.173 2.634 0.008 0.116 0.793
currentSmoker 0.0353 0.203 0.174 0.862 -0.362 0.433
male:currentSmoker 0.0149 0.247 0.060 0.952 -0.469 0.499
age 0.0563 0.008 7.242 0.000 0.041 0.071
education -0.1069 0.058 -1.835 0.067 -0.221 0.007
cigsPerDay 0.0189 0.007 2.586 0.010 0.005 0.033
BPMeds 0.0595 0.284 0.209 0.834 -0.498 0.617
prevalentStroke 0.6269 0.552 1.135 0.256 -0.456 1.709
prevalentHyp 0.2067 0.161 1.281 0.200 -0.109 0.523
diabetes -0.1072 0.386 -0.278 0.781 -0.863 0.649
totChol 0.0014 0.001 1.061 0.288 -0.001 0.004
sysBP 0.0174 0.004 3.904 0.000 0.009 0.026
diaBP -0.0050 0.007 -0.678 0.498 -0.020 0.010
BMI 0.0036 0.015 0.238 0.812 -0.026 0.033
heartRate -0.0030 0.005 -0.608 0.543 -0.013 0.007
glucose 0.0076 0.003 2.974 0.003 0.003 0.013
#Ratio of odds (of having a heart disease) of a male smoker to a female smoker
np.exp(model.params['male']+model.params['male:currentSmoker'])
1.5992822263126012

G.24 Odds

Are the odds of male non-smokers being diagnosed with heart disease even higher than female smokers, based on the model developed in the previous question?

#Ratio of odds (of having a heart disease) of a male non-smoker to a female smoker
np.exp(model.params['male']-model.params['currentSmoker'])
1.5209571795703307

G.25 Tuning threshold probability

Among the options below, which is the maximum threshold probability of classifying observations into classes (TenYearCHD = 1 and TenYearCHD = 0), such that the false negative rate less than 20% on both the test and train datasets?

#Function to compute confusion matrix and prediction accuracy on training data
def confusion_matrix_train(model,cutoff=0.5):
    # Confusion matrix
    cm_df = pd.DataFrame(model.pred_table(threshold = cutoff))
    #Formatting the confusion matrix
    cm_df.columns = ['Predicted 0', 'Predicted 1'] 
    cm_df = cm_df.rename(index={0: 'Actual 0',1: 'Actual 1'})
    cm = np.array(cm_df)
    # Calculate the accuracy
    accuracy = 100*(cm[0,0]+cm[1,1])/cm.sum()
    fnr = 100*cm[1,0]/(cm[1,0]+cm[1,1])
    return cm_df, accuracy, fnr

#Function to compute confusion matrix and prediction accuracy on test data
def confusion_matrix_test(data,actual_values,model,cutoff=0.5):
#Predict the values using the Logit model
    pred_values = model.predict(data)
# Specify the bins
    bins=np.array([0,cutoff,1])
#Confusion matrix
    cm = np.histogram2d(actual_values, pred_values, bins=bins)[0]
    cm_df = pd.DataFrame(cm)
    cm_df.columns = ['Predicted 0','Predicted 1']
    cm_df = cm_df.rename(index={0: 'Actual 0',1:'Actual 1'})
# Calculate the accuracy
    accuracy = 100*(cm[0,0]+cm[1,1])/cm.sum()
    fnr = 100*cm[1,0]/(cm[1,0]+cm[1,1])
# Return the confusion matrix and the accuracy
    return cm_df, accuracy, fnr
print(confusion_matrix_test(test,test.TenYearCHD,model,0.1))
confusion_matrix_train(model,0.1)
(          Predicted 0  Predicted 1
Actual 0        388.0        379.0
Actual 1         18.0        129.0, 56.564551422319475, 12.244897959183673)
(          Predicted 0  Predicted 1
 Actual 0       1091.0       1241.0
 Actual 1         67.0        343.0,
 52.29759299781182,
 16.341463414634145)
print(confusion_matrix_test(test,test.TenYearCHD,model,0.1))
confusion_matrix_train(model,0.1)
(          Predicted 0  Predicted 1
Actual 0        388.0        379.0
Actual 1         18.0        129.0, 56.564551422319475, 12.244897959183673)
(          Predicted 0  Predicted 1
 Actual 0       1091.0       1241.0
 Actual 1         67.0        343.0,
 52.29759299781182,
 16.341463414634145)

G.26 Forward stepwise

Use forward stepwise selection to select a logistic regression model for predicting if a patient has a risk of a 10 year coronary heart disease. How many predictors are there in the best model as per the BIC criterion?

def best_sub_plots():
    plt.figure(figsize=(20,10))
    plt.rcParams.update({'font.size': 18, 'lines.markersize': 10})

    # Set up a 2x2 grid so we can look at 4 plots at once
    plt.subplot(1, 2, 1)

    # We will now plot a red dot to indicate the model with the largest adjusted R^2 statistic.
    # The argmax() function can be used to identify the location of the maximum point of a vector
    plt.plot(models_best["Rsquared"])
    plt.xlabel('# Predictors')
    plt.ylabel('Log likelihood')

    bic = models_best.apply(lambda row: row[1].bic, axis=1)

    plt.subplot(1, 2, 2)
    plt.plot(bic)
    plt.plot(1+bic.argmin(), bic.min(), "or")
    plt.xlabel('# Predictors')
    plt.ylabel('BIC')
#Function to develop a model based on all predictors in predictor_subset
def processSubset(predictor_subset):
    # Fit model on feature_set and calculate R-squared
    model = sm.logit('TenYearCHD~' + '+'.join(predictor_subset),data = train).fit(disp=0)
    Rsquared = model.llf
    return {"model":model, "Rsquared":Rsquared}
#Function to find the best predictor out of p-k predictors and add it to the model containing the k predictors
def forward(predictors):

    # Pull out predictors we still need to process
    remaining_predictors = [p for p in X.columns if p not in predictors]
    
    tic = time.time()
    
    results = []
    
    for p in remaining_predictors:        
        results.append(processSubset(predictors+[p]))
    
    # Wrap everything up in a nice dataframe
    models = pd.DataFrame(results)
    
    # Choose the model with the highest RSS
    best_model = models.loc[models['Rsquared'].argmax()]
    
    toc = time.time()
    print("Processed ", models.shape[0], "models on", len(predictors)+1, "predictors in", (toc-tic), "seconds.")
    
    # Return the best model, along with some other useful information about the model
    return best_model
def forward_selection():
    models_best = pd.DataFrame(columns=["Rsquared", "model"])

    tic = time.time()
    predictors = []

    for i in range(1,len(X.columns)+1):    
        models_best.loc[i] = forward(predictors)
        predictors = list(models_best.loc[i]["model"].params.index[1:])

    toc = time.time()
    print("Total elapsed time:", (toc-tic), "seconds.")
    return models_best
X=train.iloc[:,0:train.shape[1]-1]
models_best = forward_selection()
Processed  15 models on 1 predictors in 0.10073018074035645 seconds.
Processed  14 models on 2 predictors in 0.08876228332519531 seconds.
Processed  13 models on 3 predictors in 0.09574484825134277 seconds.
Processed  12 models on 4 predictors in 0.10770988464355469 seconds.
Processed  11 models on 5 predictors in 0.1107032299041748 seconds.
Processed  10 models on 6 predictors in 0.10970640182495117 seconds.
Processed  9 models on 7 predictors in 0.10073137283325195 seconds.
Processed  8 models on 8 predictors in 0.10275673866271973 seconds.
Processed  7 models on 9 predictors in 0.09374809265136719 seconds.
Processed  6 models on 10 predictors in 0.14561104774475098 seconds.
Processed  5 models on 11 predictors in 0.08178091049194336 seconds.
Processed  4 models on 12 predictors in 0.06682133674621582 seconds.
Processed  3 models on 13 predictors in 0.07779192924499512 seconds.
Processed  2 models on 14 predictors in 0.04288506507873535 seconds.
Processed  1 models on 15 predictors in 0.020943880081176758 seconds.
Total elapsed time: 1.3882873058319092 seconds.
best_sub_plots()

G.27 Multicollinearity

You are developing a linear regression model to predict ‘SalePrice’. Assume all columns except ‘Id’ and ‘SalePrice’ to be predictors.

What is the minimum number of predictors to be removed from the model so that there is no multicollinearity?

Assume a VIF of less than 15 indicates absence of multicollinearity.

train = pd.read_csv('./Datasets/housing_train.csv')
test = pd.read_csv('./Datasets/housing_test.csv')
predictors = list(train.columns)[1:train.shape[1]-1]
X = train[predictors]
from statsmodels.stats.outliers_influence import variance_inflation_factor
from statsmodels.tools.tools import add_constant
def vif(X):
    X = add_constant(X)
    vif_data = pd.DataFrame()
    vif_data["feature"] = X.columns

    for i in range(len(X.columns)):
        vif_data.loc[i,'VIF'] = variance_inflation_factor(X.values, i)

    print(vif_data)
vif(X)
          feature           VIF
0           const  2.419858e+06
1      MSSubClass  1.480034e+00
2         LotArea  1.342832e+00
3     OverallQual  3.438243e+00
4     OverallCond  1.601301e+00
5       YearBuilt  3.965491e+00
6    YearRemodAdd  2.223979e+00
7      BsmtFinSF1           inf
8      BsmtFinSF2           inf
9       BsmtUnfSF           inf
10    TotalBsmtSF           inf
11     FirstFlrSF           inf
12    SecondFlrSF           inf
13   LowQualFinSF           inf
14      GrLivArea           inf
15   BsmtFullBath  2.242659e+00
16   BsmtHalfBath  1.140692e+00
17       FullBath  2.911473e+00
18       HalfBath  2.190673e+00
19   BedroomAbvGr  2.199007e+00
20   KitchenAbvGr  1.624357e+00
21   TotRmsAbvGrd  4.917930e+00
22     Fireplaces  1.556896e+00
23     GarageCars  5.565460e+00
24     GarageArea  5.397997e+00
25     WoodDeckSF  1.253248e+00
26    OpenPorchSF  1.219440e+00
27  EnclosedPorch  1.242530e+00
28       SsnPorch  1.032740e+00
29    ScreenPorch  1.127507e+00
30       PoolArea  1.095423e+00
31        MiscVal  1.032655e+00
32         MoSold  1.058330e+00
33         YrSold  1.048841e+00
C:\Users\akl0407\Anaconda3\lib\site-packages\statsmodels\stats\outliers_influence.py:193: RuntimeWarning: divide by zero encountered in double_scalars
  vif = 1. / (1. - r_squared_i)
#Remove collinear predictors one at a time
remove = ['BsmtFinSF1']
pred_filter = [x for x in predictors if x not in remove]
X = train[pred_filter]

#Recompute VIF
vif(X)
          feature           VIF
0           const  2.419858e+06
1      MSSubClass  1.480034e+00
2         LotArea  1.342832e+00
3     OverallQual  3.438243e+00
4     OverallCond  1.601301e+00
5       YearBuilt  3.965491e+00
6    YearRemodAdd  2.223979e+00
7      BsmtFinSF2  1.131439e+00
8       BsmtUnfSF  2.577553e+00
9     TotalBsmtSF  5.130743e+00
10     FirstFlrSF           inf
11    SecondFlrSF           inf
12   LowQualFinSF           inf
13      GrLivArea           inf
14   BsmtFullBath  2.242659e+00
15   BsmtHalfBath  1.140692e+00
16       FullBath  2.911473e+00
17       HalfBath  2.190673e+00
18   BedroomAbvGr  2.199007e+00
19   KitchenAbvGr  1.624357e+00
20   TotRmsAbvGrd  4.917930e+00
21     Fireplaces  1.556896e+00
22     GarageCars  5.565460e+00
23     GarageArea  5.397997e+00
24     WoodDeckSF  1.253248e+00
25    OpenPorchSF  1.219440e+00
26  EnclosedPorch  1.242530e+00
27       SsnPorch  1.032740e+00
28    ScreenPorch  1.127507e+00
29       PoolArea  1.095423e+00
30        MiscVal  1.032655e+00
31         MoSold  1.058330e+00
32         YrSold  1.048841e+00
C:\Users\akl0407\Anaconda3\lib\site-packages\statsmodels\stats\outliers_influence.py:193: RuntimeWarning: divide by zero encountered in double_scalars
  vif = 1. / (1. - r_squared_i)
#Remove another collinear predictor
remove = ['BsmtFinSF1','FirstFlrSF']
pred_filter = [x for x in predictors if x not in remove]
X = train[pred_filter]

#Recompute VIF
vif(X)
          feature           VIF
0           const  2.419858e+06
1      MSSubClass  1.480034e+00
2         LotArea  1.342832e+00
3     OverallQual  3.438243e+00
4     OverallCond  1.601301e+00
5       YearBuilt  3.965491e+00
6    YearRemodAdd  2.223979e+00
7      BsmtFinSF2  1.131439e+00
8       BsmtUnfSF  2.577553e+00
9     TotalBsmtSF  5.130743e+00
10    SecondFlrSF  6.338090e+00
11   LowQualFinSF  1.145700e+00
12      GrLivArea  1.085217e+01
13   BsmtFullBath  2.242659e+00
14   BsmtHalfBath  1.140692e+00
15       FullBath  2.911473e+00
16       HalfBath  2.190673e+00
17   BedroomAbvGr  2.199007e+00
18   KitchenAbvGr  1.624357e+00
19   TotRmsAbvGrd  4.917930e+00
20     Fireplaces  1.556896e+00
21     GarageCars  5.565460e+00
22     GarageArea  5.397997e+00
23     WoodDeckSF  1.253248e+00
24    OpenPorchSF  1.219440e+00
25  EnclosedPorch  1.242530e+00
26       SsnPorch  1.032740e+00
27    ScreenPorch  1.127507e+00
28       PoolArea  1.095423e+00
29        MiscVal  1.032655e+00
30         MoSold  1.058330e+00
31         YrSold  1.048841e+00

There is no more multicollinearity.

G.28 Lasso

Develop a lasso regression model to predict sale price based on all the predictors (except Id) in housing_train.csv. Find the RMSE (root mean squared error) of the developed model on housing_test.csv. Round up your answer to the nearest 100 greater than the answer. For example is the RMSE is 1001, enter 1100 in the box.

Note: Use this range of tuning parameter to find its optimal value:

alphas = 10**np.linspace(0,-4,200)*0.5

X = train[predictors]
#Test dataset
Xtest = test[predictors]
#Standardizing test data
y = np.log(train.SalePrice)
scaler = StandardScaler()
scaler.fit(X)
Xstd = scaler.transform(X)
Xtest_std = scaler.transform(Xtest)
alphas = 10**np.linspace(0,-4,200)*0.5

lassocv = LassoCV(alphas = alphas, cv = 10, max_iter = 100000)
lassocv.fit(Xstd, y)

#Optimal value of the tuning parameter - lamda
lassocv.alpha_
0.005359456596025638
#Using the developed lasso model to predict on test data
lasso = Lasso(alpha = lassocv.alpha_)
lasso.fit(Xstd, y)
Lasso(alpha=0.005359456596025638)
pred=np.exp(lasso.predict(Xtest_std))
np.sqrt(((pred-test.SalePrice)**2).mean())
25296.540649657465

G.29 Predictor importance

Which predictor is the most important in predicting sale price based on the lasso regression model (developed in Q28)?

Hint: Find the predictor with the highest magnitude of coefficient.

X.columns[np.argmax(np.abs(lasso.coef_))]
'OverallQual'

G.30 Improving model fit

Remove the influential points from the train data housing_train.csv. Re-develop the lasso regression model, and compute the RMSE (root mean squared error) of the developed model on housing_test.csv. Round up your answer to the nearest 100 greater than the answer. For example is the RMSE is 1001, enter 1100 in the box.

Note: Assume that a data point having a leverage more than 4 times the average leverage and a studentized residual with a magnitude of more than 3 is an influential point.

model_log = sm.ols('np.log(SalePrice)~' + '+'.join(predictors), data = train).fit()
out = model_log.outlier_test()

#Average leverage of points
average_leverage = (model_log.df_model+1)/model_log.nobs
average_leverage

#Computing the leverage statistic for each observation
influence = model_log.get_influence()
leverage = influence.hat_matrix_diag

#We will remove all observations that have leverage higher than the threshold value.
high_leverage_threshold = 4*average_leverage

#Number of high leverage points in the dataset
np.sum(leverage>high_leverage_threshold)
15
#Dropping influential points from data
train_filtered = train.drop(np.intersect1d(np.where(np.abs(out.student_resid)>3)[0],
                                           (np.where(leverage>high_leverage_threshold)[0])))
X = train_filtered[predictors]
#Standardizing test data
y = np.log(train_filtered.SalePrice)
scaler = StandardScaler()
scaler.fit(X)
Xstd = scaler.transform(X)
Xtest_std = scaler.transform(Xtest)
alphas = 10**np.linspace(0,-4,200)*0.5

lassocv = LassoCV(alphas = alphas, cv = 10, max_iter = 100000)
lassocv.fit(Xstd, y)

#Optimal value of the tuning parameter - lamda
lassocv.alpha_
0.005117057010527266
#Using the developed lasso model to predict on test data
lasso = Lasso(alpha = lassocv.alpha_)
lasso.fit(Xstd, y)
Lasso(alpha=0.005117057010527266)
pred=np.exp(lasso.predict(Xtest_std))
np.sqrt(((pred-test.SalePrice)**2).mean())
22684.858174164117